{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RE使用大致步骤\n",
    "1. 使用compilc将表示正则的字符串编译为一个pattern对象\n",
    "2. 通过pattern对象提供一系列方法对文本进行查找匹配，获得匹配结果，一个Match对象\n",
    "3. 最后使用Match对象提供的属性和方法获得信息，根据需要进行操作\n",
    "\n",
    "# RE常用函数\n",
    "- group()：获得一个或者多个分组匹配的字符串，当要获得整个匹配的子串时，直接使用group或者group（0）\n",
    "- start：获取分组匹配的子串在整个字符串中的起始位置，参数默认为0\n",
    "- end：获取分组匹配的子串在整个字符串中的结束位置，参数默认为0\n",
    "- span：返回的结构技术(stant(group), end(group))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(3, 5), match='12'>\n"
     ]
    }
   ],
   "source": [
    "# 导入相关包\n",
    "import re\n",
    "\n",
    "# 查找数字\n",
    "# r表示字符串不转义\n",
    "p = re.compile(r'\\d+')\n",
    "# 在字符串“one12towthree33456four78”中进行查找，按照规则p制定的正则进行查找\n",
    "# 返回结构是None表示没找到，否则返回match对象\n",
    "# 参数3，6表示在字符串中查找的范围\n",
    "m = p.match(\"one12towthree33456four78\", 3, 26)\n",
    "\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上述代码说明：\n",
    "1. match可以输入参数表示起始位置\n",
    "2. 查找到的结果只包含一个，表示第一次进行匹配成功的内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n",
      "3\n",
      "5\n"
     ]
    }
   ],
   "source": [
    "print(m[0])\n",
    "print(m.start(0))\n",
    "print(m.end(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 4), match='I am'>\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "# I表示忽略大小写\n",
    "p = re.compile(r'([a-z]+) ([a-z]+)', re.I)\n",
    "\n",
    "m = p.match(\"I am really love wangxiaojing \")\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I am\n",
      "0\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "print(m.group(0))\n",
    "print(m.start(0))\n",
    "print(m.end(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "am\n",
      "2\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "print(m.group(2))\n",
    "print(m.start(2))\n",
    "print(m.end(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('I', 'am')\n"
     ]
    }
   ],
   "source": [
    "print(m.groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 查找\n",
    "- search(str, [, pos[, endpos]]):在字符串中查找匹配，pos和endpos表示起始位置\n",
    "- findall：查找所有\n",
    "- finditer：查找，返回一个iter结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "p = re.compile(r'\\d+')\n",
    "m = p.search(\"one12towthree33456four78\")\n",
    "\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n",
      "['12', '33456', '78']\n"
     ]
    }
   ],
   "source": [
    "rst = p.findall(\"one12towthree33456four78\")\n",
    "\n",
    "print(type(rst))\n",
    "print(rst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# sub替换\n",
    "- sub(rep1, str[,count])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello world Hello world xiaojing, Hello world you\n"
     ]
    }
   ],
   "source": [
    "# sub替换案例\n",
    "import re\n",
    "\n",
    "p = re.compile(r'(\\w+) (\\w+)')\n",
    "s = \"hello 123 wang 465 xiaojing, i love you\"\n",
    "rst = p.sub(r'Hello world', s)\n",
    "\n",
    "print(rst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 匹配中文\n",
    "- 大部分中文表示范围是[u4e00-u9fa5], 不包括全角标点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['世界', '你好']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "title = u'世界 你好， Hello moto'\n",
    "\n",
    "p = re.compile(r'[\\u4e00-\\u9fa5]+')\n",
    "rst = p.findall(title)\n",
    "\n",
    "print(rst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 贪婪和非贪婪\n",
    "- 贪婪：尽可能多的匹配，（*）表示贪婪匹配\n",
    "- 非贪婪：找到符合条件的最小内容即可，(?)表示非贪婪\n",
    "- 正则默认使用贪婪匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div>name</div><div>age</div>\n",
      "<div>name</div>\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "title = u'<div>name</div><div>age</div>'\n",
    "\n",
    "p1 = re.compile(r\"<div.*</div>\")\n",
    "p2 = re.compile(r\"<div.*?</div>\")\n",
    "\n",
    "m1 = p1.search(title)\n",
    "print(m1.group())\n",
    "\n",
    "m2 = p2.search(title)\n",
    "print(m2.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
